L15: Linear regression intro

Preliminaries

# Import packages
import pandas as pd
import yfinance as yf
import pandas_datareader as pdr

Load Fama-French factor data:

ff3f = pdr.DataReader('F-F_Research_Data_Factors', 'famafrench', '2012-01-01')[0]/100
ff3f.head(2)

	Mkt-RF	SMB	HML	RF
Date
2012-01	0.0505	0.0203	-0.0097	0.0
2012-02	0.0442	-0.0185	0.0043	0.0

Download monthly prices (keep only Adjusted Close prices):

firm_prices = yf.download('TSLA', '2012-12-01', '2020-12-31', interval = '1mo')['Adj Close'].dropna().to_frame()
firm_prices.head(2)

[*********************100%***********************]  1 of 1 completed

	Adj Close
Date
2012-12-01	2.258000
2013-01-01	2.500667

Calculate monthly returns, drop missing and rename “Adj Close” to “TSLA”:

firm_ret = firm_prices.pct_change().dropna().rename(columns = {'Adj Close': 'TSLA'})
firm_ret.head(2)

	TSLA
Date
2013-01-01	0.107470
2013-02-01	-0.071448

We need to merge firm_ret with ff3f but note that their dates look different. Check their format first:

firm_ret.index.dtype

dtype('<M8[ns]')

ff3f.index.dtype

period[M]

Convert index of firm_ret to monthly period, to match the date format un ff3f:

firm_ret.index = firm_ret.index.to_period('M')
firm_ret.head(2)

	TSLA
Date
2013-01	0.107470
2013-02	-0.071448

Merge the two datasets:

data = firm_ret.join(ff3f)
data.head(2)

	TSLA	Mkt-RF	SMB	HML	RF
Date
2013-01	0.107470	0.0557	0.0033	0.0096	0.0
2013-02	-0.071448	0.0129	-0.0028	0.0011	0.0

data['const'] = 1
data

	TSLA	Mkt-RF	SMB	HML	RF	const
Date
2013-01	0.107470	0.0557	0.0033	0.0096	0.0000	1
2013-02	-0.071448	0.0129	-0.0028	0.0011	0.0000	1
2013-03	0.087855	0.0403	0.0081	-0.0019	0.0000	1
2013-04	0.424914	0.0155	-0.0236	0.0045	0.0000	1
2013-05	0.810706	0.0280	0.0173	0.0263	0.0000	1
...	...	...	...	...	...	...
2020-08	0.741452	0.0763	-0.0022	-0.0296	0.0001	1
2020-09	-0.139087	-0.0363	0.0004	-0.0268	0.0001	1
2020-10	-0.095499	-0.0210	0.0437	0.0422	0.0001	1
2020-11	0.462736	0.1247	0.0581	0.0213	0.0001	1
2020-12	0.243252	0.0463	0.0489	-0.0150	0.0001	1

96 rows × 6 columns

Linear regression basics

A linear regression is a statistical model, which means it is a set of assumptions about the relation between two or more variables. In particular, the standard linear regression assumptions are (we restrict ourselves to two variables X and Y for now):

A1. Linearity

The relation between the variables is assumed to be linear in parameters:

\[Y_t = \alpha + \beta \cdot X_t + \epsilon_t \]

Note that “linear in parameters” means the function that describes the relation between X and Y (the equation above) is linear with respect to $\alpha$ and $\beta$ (e.g. $Y = \alpha \cdot X^{\beta} + \epsilon$ is not linear in parameters). It does not mean that the relation needs to be linear with respect to X (e.g. $Y = \alpha + \beta \cdot X^2 + \epsilon$ is still linear in parameters).

Before we cover the remaining assumptions, a bit of terminology:

Y is commonly referred to as the “dependent”, or “explained” or “endogenous” variable
X is commonly referred to as the “independent”, or “explanatory”, or “exogenous” variable (though, remember, that X can stand for more than one variables)
$\epsilon$ is commonly referred to as the “residual” of the regression, or the “error” term, or the “disturbance” term
$\alpha$ (alpha) and $\beta$ (beta) are the “coefficients” or “parameters” of the regression. The outcome of “running a regression” is to calculate estimates for these alpha and beta coefficients.
the t subscript is meant to represent the fact that we observe multiple realizations of the X and Y variables, and the linear relation is assumed to hold for each set of realizations (different t’s can represent different points in time, or different firms, different countries, etc). Going forward in this section, we will assume that t stands for time, to make the interpretation clearer.

A2. Mean independence

This assumption states that the independent variable(s) X convey no information about the disturbance terms ($\epsilon$’s). Technically, we write this assumption as:

\[E[\epsilon_t | X] = 0\]

This is also called the “strict exogeneity” assumption. When this condition is not satisfied, we say that our regression model has an endogeneity problem.

A3. Homoskedastic and uncorrelated disturbances

This assumption states that all disturbance terms have the same variance (i.e. they are “homoskedastic”):

\[ Var[\epsilon_t | X] = \sigma^2 \text{ , for all observations $t$} \]

When this assumption is not satisfied, we say we have a heteroskedasticity problem.

The standard linear regression model also assumes that any two disturbance terms are uncorrelated with each other:

\[ Cov [\epsilon_{t_1}, \epsilon_{t_2}] = 0 \text{ , for all $t_1 \neq t_2$}\]

A4. Full rank

This assumption states that there are no exact linear relationships between the explanatory variables X (when there are two or more such variables). When this assumption is not satisfied, we say we have a multicollinearity problem.

In the next few lectures, we will cover strategies that we can use when some of the above assumptions are not satisfied.

Regression fitting: ordinary least squares (OLS)

By far the most common method for estimating linear regression coefficients is by minimizing the sum of the squares of the error terms (hence “least squares”).

The package we will use for linear regression fitting is called “statsmodels”. Install this package by typing the following in a terminal (or Anaconda Prompt):

pip install statsmodels

In fact, for the most part, we will only use the “api” subpackage of “statsmodels” as below. Here is the official documentation for the package if you want to learn more about it:

https://www.statsmodels.org/stable/index.html

import statsmodels.api as sm

As mentioned above, we will estimate our regression coefficients using OLS (ordinary least squares). This can be done with the OLS function of statsmodels:

Syntax:

class statsmodels.regression.linear_model.OLS(endog, exog=None, missing='none', hasconst=None, **kwargs)

When we use this function, we can replace statsmodels.regression.lineal_model with sm (as imported above). The endog parameter is where we specify where the data for our dependent variable is, and the exog parameter is where we specify where our independent variables are. We usually set missing=True to tell Python that we want to get rid of all the rows in our regression data that have any missing values.

Example 1: estimating a stock’s alpha and beta using the market model

The market model (aka the “single-factor model” or the “single-index model”) is a linear regression model that relates the excess return on a stock to the excess returns on the market portfolio:

\[R_{i,t} - R_{f,t} = \alpha_i + \beta_i (R_{m,t} - R_{f,t}) + \epsilon_{i,t}\]

where: - $R_{i,t}$ is the return of firm $i$ at time $t$ - $R_{m,t}$ is the return of the market at time $t$ (we generally use the S&P500 index as the market portfolio) - $R_{f,t}$ is the risk-free rate at time $t$ (most commonly the yield on the 1-month Tbill)

Below, we estimate this model for TSLA, using the data we gathered at the top of these lecture notes:

To “run” (i.e. “fit” or “estimate”) the regression, we use the .fit() function which can be applied after the sm.OLS() function. We store results in “res”:

res = sm.OLS(endog = data['TSLA']-data['RF'], 
             exog = data[['const','Mkt-RF']], 
             missing='drop'
            ).fit()
res

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fb7bb37f310>

The above shows that res is a “RegressionResultsWrapper”. We have not seen this kind of an object before. Check all the attributes of the results (res) object:

print(dir(res))

['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abat_diagonal', '_cache', '_data_attr', '_data_in_cache', '_get_robustcov_results', '_is_nested', '_use_t', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'info_criteria', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']

Particularly important results are stored in summary(), params, pvalues, tvalues, rsquared. We’ll cover all of these below. As the name suggests, the summary() attribute contains a summary of the regression results:

print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.155
Model:                            OLS   Adj. R-squared:                  0.146
Method:                 Least Squares   F-statistic:                     17.26
Date:                Wed, 22 Mar 2023   Prob (F-statistic):           7.18e-05
Time:                        07:32:31   Log-Likelihood:                 30.144
No. Observations:                  96   AIC:                            -56.29
Df Residuals:                      94   BIC:                            -51.16
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0416      0.019      2.186      0.031       0.004       0.079
Mkt-RF         1.8297      0.440      4.154      0.000       0.955       2.704
==============================================================================
Omnibus:                       29.750   Durbin-Watson:                   1.561
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               56.488
Skew:                           1.227   Prob(JB):                     5.42e-13
Kurtosis:                       5.846   Cond. No.                         24.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2: estimating a stock’s alpha and beta(s) using the Fama-French three-factor model

The Fama-French three factor model is a linear regression model that relates the excess return on a stock to the excess returns on the market portfolio and the returns on the SMB (small minus big) and HML (high minus low book-to-market) factors:

\[R_{i,t} - R_{f,t} = \alpha_i + \beta_{i,m} (R_{m,t} - R_{f,t}) + \beta_{i,smb} R_{smb} + \beta_{i,hml} R_{hml} + \epsilon_{i,t}\]

Challenge:

Estimate this regression for TSLA using the data we gathered at the top of these lecture notes.

# Run the regression and print the results
res3 = sm.OLS(endog = data['TSLA']-data['RF'], 
             exog = data[['const','Mkt-RF','SMB','HML']], 
             missing='drop'
            ).fit()
res3
print(res3.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.164
Model:                            OLS   Adj. R-squared:                  0.137
Method:                 Least Squares   F-statistic:                     6.036
Date:                Wed, 22 Mar 2023   Prob (F-statistic):           0.000847
Time:                        07:32:31   Log-Likelihood:                 30.677
No. Observations:                  96   AIC:                            -53.35
Df Residuals:                      92   BIC:                            -43.10
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0371      0.020      1.872      0.064      -0.002       0.076
Mkt-RF         1.8810      0.480      3.915      0.000       0.927       2.835
SMB            0.1931      0.794      0.243      0.808      -1.384       1.770
HML           -0.6717      0.667     -1.007      0.317      -1.997       0.653
==============================================================================
Omnibus:                       31.346   Durbin-Watson:                   1.587
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               62.467
Skew:                           1.268   Prob(JB):                     2.73e-14
Kurtosis:                       6.030   Cond. No.                         45.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpreting regression results

Coefficients

The “const” row contains information about the firm’s $\alpha$ and the “Mkt-RF” row contains information about the firm’s $\beta$. The $\alpha$ and $\beta$ coefficient estimates themselves are in the “coef” column ($\alpha = 0.0416$, and $\beta = 1.83$ in the single-factor model).

The “const” coefficient tells us what we should expect the return on TSLA to be on a day with no systematic shocks (i.e. a day with market return of 0).

The ‘Mkt-RF’ coefficient tells us how we should expect the return on TSLA to react to a given shock to the market portfolio (e.g. the 1.83 coefficient tells us that, on average, if the market goes up by 1%, TSLA goes up by 1.83%, and when the market goes down by 1%, TSLA goes down by 1.83%)

The results object (res) stores the regression coefficients in its params attribute:

res.params

const     0.041642
Mkt-RF    1.829738
dtype: float64

Note that res.params is a Pandas Series, so we can access its individual elements using the index labels:

print('Alpha = ', res.params['const'])
print('Market Beta = ', res.params['Mkt-RF'])

Alpha =  0.041642456171143115
Market Beta =  1.8297380496913744

Challenge:

Print out (separately) the alpha and each of the betas from the three-factor model

print("three-factor alpha:", res3.params['const'])
print("three-factor market beta:", res3.params['Mkt-RF'])
print("SMB beta:", res3.params['SMB'])
print("HML beta:", res3.params['HML'])

three-factor alpha: 0.037089953287553816
three-factor market beta: 1.881000087482544
SMB beta: 0.19308158602997738
HML beta: -0.6717493189696122

Statistical significance

The p-values are in the “P > |t|” column. P-values lower than 0.05 allow us to conclude that the corresponding coefficient is statistically different from 0 at the 95% confidence level (i.e. reject the null hypothesis that the coefficient is 0). At the 99% confidence level, we would need the p-value to be smaller than 1% (1 minus the confidence level) to reject the null hypothesis that alpha = 0.

The t-statistics for the two coefficients are in the “t” column. Loosely speaking a t-statistic that larger than 2 or smaller than -2 allows us to conclude that the corresponding coefficient is statistically different from 0 at the 95% confidence level (i.e. reject the null hypothesis that the coefficient is 0). In terms of statistical significance, the t-statistic does not provide any new information over the p-value.

The last two columns give us the 95% confidence interval for each coefficient.

For the market model, TSLA’s alpha has a p-value of 0.031 so we can conclude that it’s alpha is statistically significantly different from 0 at the 95% confidence level (but not at the 99% confidence level).

The fact that the alpha is positive and statistically different from 0 (at 95% level) means that, based on the single-factor model, TSLA seems to be undervalued. A negative alpha would mean the stock in overvalued.

If we can not reject the null hypothesis that alpha is 0, the conclusion is NOT that alpha = 0 and therefore the stock is correctly valued (since we can never “accept” a null hypothesis, we can only fail to reject). The conclusion is that we do not have enough evidence to claim that the stock is either undervalued or overvalued (which is not the same thing as saying that we have enough evidence to claim that the stock is correctly valued).

The results object (res) stores the regression p-values in its pvalues attribute:

res.pvalues

const     0.031312
Mkt-RF    0.000072
dtype: float64

The p-values can be accessed individually:

print("Alpha p-value = ", res.pvalues['const'])
print("Beta p-value = ", res.pvalues['Mkt-RF'])

Alpha p-value =  0.031311923663591416
Beta p-value =  7.182372372221883e-05

T-statistics are stored in the tvalues attribute:

res.tvalues

const     2.185832
Mkt-RF    4.154374
dtype: float64

T-statistics of individual coefficients:

print("Alpha t-stat = ", res.tvalues['const'])
print("Beta t-stat = ", res.tvalues['Mkt-RF'])

Alpha t-stat =  2.1858319462947398
Beta t-stat =  4.154374466130978

Challenge:

Is TSLA mispriced (undervalued OR overvalued) at the 5% significance level with respect to the Fama-French 3-factor model?

print("TSLA mispriced? \n", res3.pvalues['const'] < 0.05)

TSLA mispriced? 
 False

Challenge:

Does TSLA have a significant exposure (at 5% level) to either of the 3 factors in the Fama-French model?

print('Significant market exposure?\n', res3.pvalues['Mkt-RF'] < 0.05)
print('Significant exposure to SMB?\n', res3.pvalues['SMB'] < 0.05)
print('Significant exposure to HML?\n', res3.pvalues['HML'] < 0.05)

Significant market exposure?
 True
Significant exposure to SMB?
 False
Significant exposure to HML?
 False

The R-squared coefficient

The R-squared coefficient (top-right of the table, also referred to as the “coefficient of determination”) estimates the percentage of the total variance in the dependent variable (Y) that can be explained by the variance in the explanatory variable(s) (X).

In the context of our market-model example, the R-squared tells us the percentage of the firm’s total variance that is systematic in nature (i.e. non-diversifiable). The percentage of total variance that is idiosyncratic (diversifiable) equals 1 minus the R-squared.

The R-squared is stored in the rsquared attribute of the regression results object:

res.rsquared

0.1551232170825012

Using the market model, our estimates of the percentage or total TSLA variance that is systematic vs idiosyncratic are:

print("Percent of total variance that is systematic: ", res.rsquared)
print("Percent of total variance that is idiosyncratic: ", 1 - res.rsquared)

Percent of total variance that is systematic:  0.1551232170825012
Percent of total variance that is idiosyncratic:  0.8448767829174988

Challenge:

What percentage of TSLA total variance can be diversified away under the Fama-French 3-factor model?

print("Percent of total variance that is systematic: ", res3.rsquared)
print("Percent of total variance that is idiosyncratic: ", 1 - res3.rsquared)

Percent of total variance that is systematic:  0.16445635788368973
Percent of total variance that is idiosyncratic:  0.8355436421163103

Diagnostics (bottom of the table):

The regression table reports (at the bottom) a few statistics that help us understand if some of the assumption of the linear regression model are not satisfied.

Durbin-Watson: tests for residual autocorrelation. Takes values in [0,4]. Below 2 means positive autocorr. Above 2 means negative aoutocorr. A value of 2 is perfect (means no autocorrelation).
Omnibus: tests normality of residuals. Prob(Omnibus) close to 0 means reject normality
JB: another normality test (null is skew=0, kurt=3). Prob(JB) close to 0 means rejection of normality
Cond. No: tests for multicollinearity. Over 100 is worrisome, but we still have to look at correlations between variables to determine if any of them need to be dropped.